29 research outputs found

    Gazelle: A Low Latency Framework for Secure Neural Network Inference

    Get PDF
    The growing popularity of cloud-based machine learning raises a natural question about the privacy guarantees that can be provided in such a setting. Our work tackles this problem in the context where a client wishes to classify private images using a convolutional neural network (CNN) trained by a server. Our goal is to build efficient protocols whereby the client can acquire the classification result without revealing their input to the server, while guaranteeing the privacy of the server's neural network. To this end, we design Gazelle, a scalable and low-latency system for secure neural network inference, using an intricate combination of homomorphic encryption and traditional two-party computation techniques (such as garbled circuits). Gazelle makes three contributions. First, we design the Gazelle homomorphic encryption library which provides fast algorithms for basic homomorphic operations such as SIMD (single instruction multiple data) addition, SIMD multiplication and ciphertext permutation. Second, we implement the Gazelle homomorphic linear algebra kernels which map neural network layers to optimized homomorphic matrix-vector multiplication and convolution routines. Third, we design optimized encryption switching protocols which seamlessly convert between homomorphic and garbled circuit encodings to enable implementation of complete neural network inference. We evaluate our protocols on benchmark neural networks trained on the MNIST and CIFAR-10 datasets and show that Gazelle outperforms the best existing systems such as MiniONN (ACM CCS 2017) by 20 times and Chameleon (Crypto Eprint 2017/1164) by 30 times in online runtime. Similarly when compared with fully homomorphic approaches like CryptoNets (ICML 2016) we demonstrate three orders of magnitude faster online run-time

    An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for End-to-End Security in IoT Applications

    Get PDF
    This paper presents a reconfigurable cryptographic engine that implements the DTLS protocol to enable end-to-end security for IoT. This implementation of the DTLS engine demonstrates 10x reduction in code size and 438x improvement in energy-efficiency over software. Our ECC primitive is 237x and 9x more energy-efficient compared to software and state-of-the-art hardware respectively. Pairing the DTLS engine with an on-chip RISC-V allows us to demonstrate applications beyond DTLS with up to 2 orders of magnitude energy savings.Comment: Published in 2018 IEEE International Solid-State Circuits Conference (ISSCC

    An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for End-to-End Security in IoT Applications

    Full text link
    This paper presents a reconfigurable cryptographic engine that implements the DTLS protocol to enable end-to-end security for IoT. This implementation of the DTLS engine demonstrates 10x reduction in code size and 438x improvement in energy-efficiency over software. Our ECC primitive is 237x and 9x more energy-efficient compared to software and state-of-the-art hardware respectively. Pairing the DTLS engine with an on-chip RISC-V allows us to demonstrate applications beyond DTLS with up to 2 orders of magnitude energy savings.Comment: Published in 2018 IEEE International Solid-State Circuits Conference (ISSCC

    Fast Vector Oblivious Linear Evaluation from Ring Learning with Errors

    Get PDF
    Oblivious linear evaluation (OLE) is a fundamental building block in multi-party computation protocols. In OLE, a sender holds a description of an affine function fα,β(z)=αz+βf_{\alpha,\beta}(z)=\alpha z+\beta, the receiver holds an input xx, and gets αx+β\alpha x+\beta (where all computations are done over some field, or more generally, a ring). Vector OLE (VOLE) is a generalization where the sender has many affine functions and the receiver learns the evaluation of all of these functions on a single point xx. The state-of-the-art semi-honest VOLE protocols generally fall into two groups. The first group relies on standard assumptions to achieve security but lacks in concrete efficiency. These constructions are mostly based on additively homomorphic encryption (AHE) and are classified as ``folklore . The second group relies on less standard assumptions, usually properties of sparse, random linear codes, but they manage to achieve concrete practical efficiency. In this work, we present a conceptually simple VOLE protocol that derives its security from a standard assumption, namely Ring Learning with Errors (RLWE), while still achieving concrete efficiency comparable to the fastest VOLE protocols from non-standard coding assumptions. Furthermore, our protocol admits a natural extension to batch OLE (BOLE), which is yet another variant of OLE that computes many OLEs in parallel

    A 249-Mpixel/s HEVC Video-Decoder Chip for 4K Ultra-HD Applications

    Get PDF
    High Efficiency Video Coding, the latest video standard, uses larger and variable-sized coding units and longer interpolation filters than [H.264 over AVC] to better exploit redundancy in video signals. These algorithmic techniques enable a 50% decrease in bitrate at the cost of computational complexity, external memory bandwidth, and, for ASIC implementations, on-chip SRAM of the video codec. This paper describes architectural optimizations for an HEVC video decoder chip. The chip uses a two-stage subpipelining scheme to reduce on-chip SRAM by 56 kbytes-a 32% reduction. A high-throughput read-only cache combined with DRAM-latency-aware memory mapping reduces DRAM bandwidth by 67%. The chip is built for HEVC Working Draft 4 Low Complexity configuration and occupies 1.77 mm[superscript 2] in 40-nm CMOS. It performs 4K Ultra HD 30-fps video decoding at 200 MHz while consuming 1.19 [nJ over pixel] of normalized system power.Texas Instruments Incorporate

    A 249Mpixel/s HEVC video-decoder chip for Quad Full HD applications

    Get PDF
    The latest video coding standard High Efficiency Video Coding (HEVC) provides 50% improvement in coding efficiency compared to H.264/AVC, to meet the rising demand for video streaming, better video quality and higher resolutions. The coding gain is achieved using more complex tools such as larger and variable-size coding units (CU) in a hierarchical structure, larger transforms and longer interpolation filters. This paper presents an integrated circuit which supports Quad Full HD (QFHD, 3840×2160) video decoding for the HEVC draft standard. It addresses new design challenges for HEVC (“H.265”) with three primary contributions: 1) a system pipelining scheme which adapts to the variable-size largest coding unit (LCU) and provides a two-stage sub-pipeline for memory optimization; 2) unified processing engines to address the hierarchical coding structure and many prediction and transform block sizes in area-efficient ways; 3) a motion compensation (MC) cache which reduces DRAM bandwidth for the LCU and meets the high throughput requirements which are due to the long filters.Texas Instruments Incorporate

    An Energy-Efficient Reconfigurable DTLS Cryptographic Engine for Securing Internet-of-Things Applications

    Full text link
    This paper presents the first hardware implementation of the Datagram Transport Layer Security (DTLS) protocol to enable end-to-end security for the Internet of Things (IoT). A key component of this design is a reconfigurable prime field elliptic curve cryptography (ECC) accelerator, which is 238x and 9x more energy-efficient compared to software and state-of-the-art hardware respectively. Our full hardware implementation of the DTLS 1.3 protocol provides 438x improvement in energy-efficiency over software, along with code size and data memory usage as low as 8 KB and 3 KB respectively. The cryptographic accelerators are coupled with an on-chip low-power RISC-V processor to benchmark applications beyond DTLS with up to two orders of magnitude energy savings. The test chip, fabricated in 65 nm CMOS, demonstrates hardware-accelerated DTLS sessions while consuming 44.08 uJ per handshake, and 0.89 nJ per byte of encrypted data at 16 MHz and 0.8 V.Comment: Published in IEEE Journal of Solid-State Circuits (JSSC

    Does Fully Homomorphic Encryption Need Compute Acceleration?

    Get PDF
    The emergence of cloud-computing has raised important privacy questions about the data that users share with remote servers. While data in transit is protected using standard techniques like Transport Layer Security (TLS), most cloud providers have unrestricted plaintext access to user data at the endpoint. Fully Homomorphic Encryption (FHE) offers one solution to this problem by allowing for arbitrarily complex computations on encrypted data without ever needing to decrypt it. Unfortunately, all known implementations of FHE require the addition of noise during encryption which grows during computation. As a result, sustaining deep computations requires a periodic noise reduction step known as bootstrapping. The cost of the bootstrapping operation is one of the primary barriers to the wide-spread adoption of FHE.In this paper, we present an in-depth architectural analysis of the bootstrapping step in FHE. First, we observe that se-cure implementations of bootstrapping exhibit a low arithmetic intensity (100MB) and as such, are heavily bound by the main memory bandwidth.Consequently, we demonstrate that existing workloads observe marginal performance gains from the design of bespoke high-throughput arithmetic units tailored to FHE. Secondly, we propose several cache-friendly algorithmic optimizations that improve the throughput in FHE bootstrapping by enabling upto3.2Ă—higher arithmetic intensity and4.6Ă—lower memory bandwidth. Our optimizations apply to a wide range of structurally similar computations such as private evaluation and training of machine learning models. Finally, we incorporate these optimizations into an architectural tool which, given a cache size, memory subsystem, the number of functional units and a desired security level, selects optimal cryptosystem parameters to maximize the bootstrapping throughput.Our optimized bootstrapping implementation represents a best-case scenario for compute acceleration of FHE. We show that despite these optimizations, bootstrapping (as well as other applications with similar computational structure) continue to remain bottlenecked by main memory bandwidth. We thus conclude that secure FHE implementations need to look beyond accelerated compute for further performance improvements and to that end, we propose new research directions to address the underlying memory bottleneck. In summary, our answer to the titular question is: yes, but only after addressing the memory bottleneck
    corecore